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The population of the United States is shaped by centuries of migration, isolation, growth, and admixture between ancestors of global 
origins. Here, we assemble a comprehensive view of recent population history by studying the ancestry and population structure of 
more than 32,000 individuals in the US using genetic, ancestral birth origin, and geographic data from the National Geographic Geno- 
graphic Project. We identify migration routes and barriers that reflect historical demographic events. We also uncover the spatial pat- 
terns of relatedness in subpopulations through the combination of haplotype clustering, ancestral birth origin analysis, and local 
ancestry inference. Examples of these patterns include substantial substructure and heterogeneity in Hispanics/Latinos, isolation- 
by-distance in African Americans, elevated levels of relatedness and homozygosity in Asian immigrants, and fine-scale structure in 
European descents. Taken together, our results provide detailed insights into the genetic structure and demographic history of the 


diverse US population. 
Introduction 


The United States population is a diverse collection of 
global ancestries shaped by migration from distant conti- 
nents and admixture of recent migrants and Native Amer- 
icans. Throughout the past few centuries, continuous 
migration and gene flow have played major roles in 
shaping the diversity of the US. Mixing between groups 
that have historically been genetically and spatially 
distinct have resulted in individuals with complex ances- 
tries, while within-country migration has led to genetic 
differentiation.’ 

Deeply characterizing population history is important 
for understanding human evolution and demographic his- 
tory, as well as for adequate study design when associating 
genotypes to phenotypes.'*'” Earlier population genetic 
studies in the US broadly characterized this structure, typi- 
cally using a limited set of ancestry-informative markers or 
uniparental mtDNA and Y chromosome DNA data.'® As 
the cost of genetic technologies have dropped, more recent 
studies have inferred population history with more com- 
plete genome-wide data, typically using more than 
100,000 SNPs ascertained via sequencing or genotyping. 

Previous genetic studies of the US population have 
sought to infer genetic ancestry and population 
history primarily in European Americans, African Ameri- 
cans, and Hispanics/Latinos.’~”"'”° European American 
ancestry is characterized by substantial mixing between 
different ancestral European populations and, to a lesser 


extent, admixture with non-European populations.” 
Isolation among certain European population, such as 
Ashkenazi Jewish, French Canadian, and Finnish popula- 
tions, have also resulted in founder effects.” The 
mixing of European settlers with Native Americans has 
contributed to large variations in the admixture propor- 
tions of different Hispanic/Latino populations.’ Among 
Hispanics/Latinos, Mexicans and Central Americans have 
more Native American ancestry; Puerto Ricans and Domin- 
icans have more African ancestry; and Cubans have more 
European ancestry.'* In African Americans, proportions 
of African, European, and Native American ancestry vary 
across the country and reflect migration routes, slavery, 
and patterns of segregation between states.” 7>?S 
Although much effort has been made to understand the 
genetic diversity in the US, fine-scale patterns of demog- 
raphy, migration, isolation, and founder effects are still be- 
ing uncovered with the growing scale of genetic data, 
particularly for Latin American and African descendants 
with complex admixture histories.*°*’ At the same time, 
there has been little research on the population structure 
of individuals with East Asian, South Asian, and Middle 
Eastern ancestry in the US. 

Many previous studies have investigated specific popula- 
tion histories in the US at relatively small scales—on the 
order of hundreds to thousands of individuals. These 
studies have provided deep insights into many specific 
populations, with some well-powered to infer population 
history across a breadth of ancestries. Some of these 
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insights have been made by applying methods that are 
computationally tractable only at smaller scales.”®?? 
More recently, however, important insights highlight the 
need for broader and more comprehensive investigations 
of population history. For example, recent studies have 
shown that population structure is inaccurately captured 
in small sample sizes.'*’'’ Additionally, millions of Ameri- 
cans have been interested enough in their genetic ancestry 
to pay direct-to-consumer companies for individual-level 
genetic ancestry reports.*’’ The reliability of these reports 
is high for many individuals, but they are dependent on 
(1) the representativeness of their reference panel or 
customer database, (2) completeness and accuracy of 
multigenerational birth origin data, and (3) the application 
of multiple approaches to gain holistic insights into popu- 
lation history. 

In this study, we comprehensively evaluate the popula- 
tion history of more than 32,000 genotyped individuals 
in the US who partook in the National Geographic Geno- 
graphic Project, a not-for-profit public participation 
research initiative to study human migration history. 
This project has several distinct advantages compared to 
other large-scale population genetics datasets. Participants 
were genotyped with the GenoChip, a validated array of 
~150,000 markers designed for genetic anthropology 
that excludes medically related SNPs to protect the health 
privacy of participants.*° Individual-level genetic data are 
accessible to researchers around the world to answer 
anthropological questions. Additionally, most participants 
report birthplace and ethnicity data for themselves, their 
parents, and their grandparents, enabling fine-scale in- 
sights into recent history. Furthermore, participants report 
their postal code when they participated in the study, 
enabling analysis of intragenerational migration. These 
data therefore enable high spatiotemporal resolution into 
historical migration patterns. While these trends are 
consistent with US history at the population scale, we 
note that genetic ancestry patterns are not commensurate 
with individual-level ethnicity (i.e., cultural identity). 

Here, we leverage these advantages over existing data to 
identify patterns of genetic ancestry by studying pairwise 
sharing among the project participants. We combine these 
comparative patterns with ancestral birth origin records 
and geographic information to uncover recent demo- 
graphic and migration trends. By comprehensively 
analyzing these data to learn about recent migration 
events, we gain deeper insights into ancestral origins 
than in many existing studies, especially into Latin Amer- 
ica. We also provide early insights into Asian Americans 
often ignored in genetic studies of the US, including South 
Asians, East Asians, and Middle Easterners. We also identify 
detailed patterns among European and African American 
populations, recapitulating some similar trends reported 
previously. Taken together, we use accessible individual- 
level genetic and birth record data to provide insights 
into the ancestral origins and complex population his- 
tories in the US. 


Material and Methods 


Human Subjects 

The Genographic Project and Geno 2.0 Project received full 
approval from the Social and Behavioral Sciences Institutional Re- 
view Board (IRB) at the University of Pennsylvania Office of Regu- 
latory Affairs on April 12, 2005. The IRB operates in compliance 
with applicable laws, regulations, and ethical standards necessary 
for research involving human participants. All DNA samples 
included in this study came from customers of the National 
Geographic Genographic Project, who have consented to have 
their results used in scientific research. To participate in the Geno- 
graphic Project, participants would first order a DNA Ancestry Kit 
through the Genographic Project website. To ensure anonymity, 
each DNA Ancestry Kit is encoded with a randomly generated, 
nonsequential, Genographic Participant ID number. Prior to 
providing a sample, participants must read an IRB-approved con- 
sent form and provide written consent. Participants would then 
give a saliva sample, mix the saliva sample with a stabilization 
buffer solution, and return it along with their completed consent 
form via postal mail. DNA is then extracted from the saliva sample 
and genome-wide genotyping was performed (Genotyping and 
Quality Control). Once participants obtain their results, they can 
voluntarily provide an additional separate consent on the Geno- 
graphic Project website to make their genotype data anonymously 
available for qualified anthropological and genetic research. 

In addition to providing a DNA sample, participants also pro- 
vided geographic location (postal code) data and, optionally, fam- 
ily history information in the form of ancestral birth origin and 
ethnicity (up to grandparental level). All data of individuals who 
consented to research were deidentified prior to its inclusion in 
the Genographic research database. We limited our study to those 
individuals who provided valid geographic locations in the United 
States. Approximately 75% of individuals selected provided com- 
plete pedigrees and family history data (see Supplemental Material 
and Methods for further detail). 


Genotyping and Quality Control 

Participants of the Genographic project were genotyped with the 
GenoChip array,” an Illumina iSelect HD custom genotyping 
bead array with approximately 150,000 markers that are Ancestry 
Informative Markers. It excludes markers that are medically related 
to protect the health privacy of participants and minimize the 
improper translation of direct-to-customer genetic ancestry results 
to clinical care.*! The ability of the Genochip array to discern sub- 
populations was validated by producing concordant ancestry pat- 
terns with samples from the 1000 Genomes Project and demon- 
strating similar Fs; distributions and higher mean Fs; values 
when compared to the Affymetrix Axiom Human Origins array 
(used in HGDP-CEPH) and the Illumina Human660W-Quad Bead- 
Chip.°° Raw genotype data were quality controlled (QC) using 
PLINK v1.90b3.39.°” We filtered to keep samples with < 0.1 miss- 
ingness, sites with = 0.0 missingness, and MAF > 0.05. A total of 
32,589 individuals and 108,003 SNPs passed quality control. 


Ancestry Reference Panels 

We leveraged a variety of reference populations to help better infer 
and interpret the genetic ancestry, admixture proportions, and 
population structure in the Genographic cohort. Data from the 
1000 Genomes Project was used to help identify genetic ancestry 
and estimate admixture proportions.’ 108,003 SNPs were shared 
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between the Genographic samples and the 1000 Genomes Project 
samples. We also used data from the Population Reference Sample 
(POPRES) to help understand the population structure of individ- 
uals with European ancestry in the Genographic cohort.” All 
analysis with the POPRES data was limited to the 46,710 SNPs 
that are shared between the two datasets. We also leveraged 
recently released sequence data for the Human Genome Diversity 
Project (HGDP) to expand the available set of ancestral popula- 
tions from Asia.** All analyses using the HGDP data was performed 
using the 105,944 SNPs shared between the samples in Geno- 
graphic Project and HGDP. 


Principal Component Analysis 

We performed principal component analysis (PCA) on the quality- 
controlled samples using FlashPCA v.2.0.*° For PCA of all Geno- 
graphic Project individuals, we used the genotypes of all 2,504 
individuals from the 1000 Genomes Project as reference samples. 
We first computed PCs across the 108,003 shared sites for 1000 Ge- 
nomes Project individuals. We then projected the Genographic 
Project individuals on the same principal component space using 
the flag: --project. 

For PCA analyses of East Asian and South Asian populations, we 
used samples from 1000 Genomes Project that correspond to the 
East Asian and South Asian super populations. Similar to above, 
we first compute PCs for the 1000 Genomes Project samples sepa- 
rately for East Asians and South Asians. We then projected East 
Asian and South Asian Genographic Project individuals onto the 
respective principal component space using: --project. 


Continental Ancestry Assignment 

We assigned continental ancestry to each Genographic sample by 
using a random forest classifier. Using the PCs and known super 
population assignments (AFR, African; EUR, European; EAS, East 
Asian; AMR, admixed American; and SAS, South Asian) from the 
1000 Genomes Project samples as training data, we applied the clas- 
sifier to assign ancestry to each Genographic sample at 90% proba- 
bility. We considered unassigned ancestries as “other” (OTH). 


Comparison of Continental Ancestry Assignment with 
Self-Reported Data 

To evaluate the concordance between continental ancestry assign- 
ments based on genetics and self-reported ethnicity, we standard- 
ized self-reported ancestral ethnicities and estimated the propor- 
tion of assigned individuals within each continental ancestry 
groups that have at least one grandparent with a matching conti- 
nental ancestry. Since ancestral ethnicity data were provided in 
the form of free text and was therefore not standardized across par- 
ticipants, we manually cleaned and mapped the reported ethnic- 
ities to continent level ancestries. For example, African ancestry 
can include a country (e.g., Jamaican, Nigerian, Cape Verdean), 
an ethnic group (e.g., Amhara or Tigray from Ethiopia), a historical 
term used to describe African descendants in America (e.g., Melun- 
geon, Maroon, Mulatto), or the commonly used terms of African 
American or Black. 


Genetic Ancestry Proportion Estimation 

We estimated admixture proportions using ADMIXTURE by first 
analyzing samples from the 1000 Genomes Project in unsuper- 
vised mode to learn allele frequencies.*° Then, we projected the 
learned allele frequencies onto the Genographic samples to obtain 
the admixture proportions using the flag: -P. We ran ADMIXTURE 


with K = 2-9 and chose K = 5 as the most stable representation 
based on cross-validation. 

For the analysis of East Asian and South Asian, we combined sam- 
ples from HGDP and 1000 Genomes Project together to build more 
comprehensive reference panels. Specifically, we combined 1000 
Genomes Project populations under the East Asian (EAS) super pop- 
ulation label with HGDP samples that have the East Asian and Oce- 
ania region label, and we combined 1000 Genomes Project samples 
under the South Asian super population label with Central South 
Asia labeled populations in HGDP. Similar to above, we first ran 
ADMIXTURE on the ancestral reference panels for East Asians and 
South Asians, separately. We then projected the learned allele fre- 
quencies onto the Genographic samples to obtain admixture pro- 
portions using the flag: -P. We tested a variety of clusters, K = 2-9, 
and chose K = 4 for East Asians and K = 3 for South Asians as the 
most stable representations. 


UMAP 

We applied the Uniform Manifold Approximation and Projection 
(UMAP) method to visualize subcontinental structure.°””** We first 
combined the PCs of the Genographic samples and the 1000 Ge- 
nomes Project samples into one dataset. We then applied UMAP 
on the first 20 PCs from the joint dataset to produce a two-dimen- 
sional plot. We tested various parameter choices for UMAP and found 
that the default nearest neighbor value of 15 and the minimum dis- 
tance values of 0.5 delivered the clearest result. Coloring of UMAP 
plots are described in the Supplemental Material and Methods. 

We further examined the subcontinental structure of Geno- 
graphic Project individuals who were classified as European 
ancestry individuals with data from the Population Reference 
Sample (POPRES).** Similar to the analyses with the 1000 Ge- 
nomes Project data, we performed dimensionality reduction 
with PCA and UMAP, keeping the same parameter values. Color- 
ing of POPRES data was grouped by continental regions: Southeast 
Europeans = Croatia, Yugoslavia, Bosnia-Herzegovina, Serbia, 
Romania, Hungary, Albania, Macedonia; Central Europe = 
Switzerland, France, Germany, Germany, Swiss-Italian, Belgium, 
Swiss-French, Netherlands, Swiss-German; British Isle = Scotland, 
Ireland, United Kingdom; South Europe = Italy, Cyprus, Turkey, 
Greece; Iberian = Portugal, Spain; Eastern Europe = Austria, Czech 
Republic, Poland, Russia; Scandinavia = Sweden, Norway. 


Phasing and Haplotype Estimation 

Genographic genotypes were phased with the Sanger Imputation 
Service using EAGLE2*” and the Haplotype Reference Consortium 
reference panel.*° No genotype imputation was performed. 


fineSTRUCTURE Analysis 

For classified East Asian individuals and South Asian individuals, 
we inferred clusters of unrelated individuals with shared ancestries 
by applying the fineSTRUCTURE framework v.4.0.1, a model- 
based approach to estimate patterns of haplotype similarity 
and identify clusters of discrete populations.”? We performed 
fineSTRUCTURE analysis separately for the two populations. The 
first part of the fineSTRUCTURE framework uses ChromoPainter 
to measure shared ancestry between individuals and estimate a 
coancestry matrix. This matrix is then used in fineSTRUCTURE’s 
clustering and tree-building algorithm to hierarchically cluster in- 
dividuals from fine levels of structuring to broader levels. We first 
applied ChromoPainter to phased genotypes to estimate the num- 
ber of contiguous segments (chunks) shared and total amount of 
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genome (in cM) shared between each pair of individuals within 
each population, as well as the normalization parameter (c). Using 
the coancestry matrix and normalized parameter, we then ran the 
fineSTRUCTURE with 2 million Markov Chain Monte Carlo 
(MCMC) iterations, of which 1 million are “burn-in” iterations, 
and every 2,000 iterations was sampled. Finally, we used 
fineSTRUCTURE to infer a hierarchical tree using 100,000 hill- 
climbing moves. We used the scripts accompanying the 
fineSTRUCTURE software as well as the ape package in R to visu- 
alize the coancestry matrix and dendrogram results. 

To examine the properties of the inferred clusters, we sought to 
examine structure at both the broad-scale and fine-scale. There is 
no definitively correct level of the dendrogram to pick for exami- 
nation. We examined clades at various levels of the tree and as- 
sessed broad structure at the levels in which clades had sufficient 
number of individuals (on average 50 or more samples). We 
further used a combination of PCA and analysis of ancestral ori- 
gins to assess and define these clades. Some of the clusters are 
small but genetically distinct as evident by the branch length 
and height of the split (i.e., Girmitiyas, Bangladesh), and there- 
fore, they were kept as separate clades. 

Unlike traditional PCA, PCA using the coancestry matrix (i.e., 
chunk counts matrix) can better discern fine-scale population 
structure and provide greater interpretability.” We performed 
PCA on the chunk counts matrix using in the Python library sci- 
kit-learn. Individual markers are colored and labeled based on their 
respective grouping. 


Estimating Effective Migration Surfaces 

We estimated migration and diversity relative to geographic dis- 
tance using the estimating effective migration surfaces (EEMS) 
method for Genographic Project individuals that were classified 
under African, European, and admixed American ancestries.*! 
We excluded East Asian and South Asian ancestries due to low 
sample size and density. We used unrelated individuals with avail- 
able postal code data. We first computed pairwise genetic dissimi- 
larities with the EEMS bed2diffs tool and then ran EEMS with 
runeems_snps, setting the number of demes to 250 and to 500. 
Per the recommendation in the manual, we adjusted the variance 
for all proposed distributions of diversity, migration, and degree- 
of-freedom parameters such that all were accepted 10%-40% of 
the time. We increased the number of Markov chain Monte Carlo 
(MCMC) iterations until it converged. 

To evaluate the robustness of EEMS to sampling bias, we simu- 
lated three different sampling schemes. We used individuals clas- 
sified with African ancestry as it is the smallest of the three ances- 
tries and therefore more likely to be impacted by sampling bias. In 
the first sampling scheme, we randomly subsampled individuals 
to 80% of the original sample size. In the second scheme, we 
used the US Census Regions assignments for states and explored 
the impact of even sampling across the four major Census Re- 
gions. We subsampled African Americans so that each Census 
Region was represented in equal proportions. In the last sampling 
scheme, we explored the scenario of overrepresentation in the 
South by subsampling at 80% but this time with half of the sub- 
samples being from the South and the remaining samples are 
evenly distributed across the three regions. 


Haplotype Calling and Network Construction 
We used IBDSeq v.r1206 to generate shared identity-by-descent 
(IBD) segments from genotype data for all unrelated individ- 


uals.** Unlike other IBD detection algorithms, IBDseq does not 
rely on phased genotype data and is less susceptible to switch er- 
rors in phasing that can cause erroneous haplotype breaks. We fil- 
ter for IBD segments greater than 3 cM. We removed segments that 
overlapped with long chromosomal regions (1 Mb) that had no 
SNPs across all unrelated individuals. These sites can result in false 
positives IBD sharing and likely correspond to centromeres and 
telomeres. We calculate the cumulative IBD sharing between indi- 
viduals by summing the length of all shared IBD segments. We 
then constructed a haplotype network of unrelated individuals 
by defining vertices an individuals and edge weights between 
vertices as the cumulative IBD sharing between individuals. 
We filtered for edges with cumulative IBD sharing is >12 cM 
and <72 cM, as previously described.® 


Detection of IBD Clusters 

While fineSTRUCTURE can identify population structure in ad- 
mixed cohorts using haplotype similarity,” fineSTRUCTURE does 
not scale to large sample sizes and is not recommended for samples 
>10,000.”° We therefore sought to identify clusters of related indi- 
viduals in the haplotype network using the Louvain Method imple- 
mented in the igraph package for R. The Louvain Methodis a greedy 
iterative algorithm that assigns vertices of a graph into clusters to 
optimize modularity (a measure of the density of edges within a 
community to edges between communities). The Louvain 
Method begins by first assigning each node as its own community 
and then adds node i to a neighbor community j. It then calculates 
the change in modularity and places iin the community that max- 
imizes modularity. The algorithm repeats this continuously and 
terminates when no vertices can be reassigned. 

We partitioned the haplotype network into clusters by recur- 
sively applying the Louvain Method within subcommunities. At 
the highest level, we take the full, unpartitioned haplotype graph 
and identify a set of subcommunities. We isolate the vertices 
within each subcommunity, keeping only the edges between 
those vertices to create separate new networks. We then apply 
the Louvain Method to the new subgraphs. We repeat this process 
up to four levels. We combined subcommunities with low genetic 
divergence based on Fsr values of < 0.0001. 


Annotation of IBD Clusters 

We used a combination of ancestral birth origins and self-reported 
ethnicities to discern demographic characteristics of each cluster. 
For each cluster, we quantified the proportion of each birth origin 
(i.e., country of origin) among all four grandparents, treating each 
grandparent’s origin equality. We use these proportions to inform 
population labels. Clusters in which a single non-US birth origin 
was in high proportions was labeled with that country. In cases 
where multiple non-US birth locations exists in approximately 
equally high proportions, we assigned a label representing the 
broader region (e.g., Eastern Europeans for Poland, Lithuania, 
Ukraine, and Slovakia; East Asia for Japan, China). For certain clus- 
ters, annotations could not be easily discerned by birth origin data. 
In these cases, we relied on self-reported ethnicities to label the clus- 
ters as these populations were found to be less associated with a non- 
US country (e.g., Ashkenazi Jews) or the population has resided in the 
US for generations (e.g., African Americans, Acadians). 


Runs of Homozygosity 
We used PLINK v.1.90b3.39 to infer runs of homozygosity with a 
window of 25 SNPs.*” By default, PLINK reports only the runs of 
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homozygosity with lengths > 1 Mb. For each individual, we calcu- 
lated the sum of runs of homozygosity (SROH) by summing the 
lengths of homozygous segments. We compared ROH segments 
inferred by PLINK with homozygosity-by-descent (HBD) segments 
inferred using IBDSeq. The two approaches largely agreed in ROH 
lengths (Spearman correlation = 0.94; p = 7.24 x 1071), with the 
exception that the median sROH lengths for the Greece-Italy and 
Italy clusters were lower in IBDSeq while the median sROH length 
for East Asians were higher in IBDSeq when compared to PLINK 
(Table S1). 


Local Ancestry Inference 

We inferred local ancestry with RFMix v.1.5.4 for Genographic 
samples in clusters that were annotated as Hispanics/Latinos and 
African Americans.** We used samples of African (LWK, MSL, 
GWD, YRI, ESN, ACB, and ASW; n = 661), European (CEU, GBR, 
FIN, IBS, and TSI; n = 503), and Native American (MXL, PUR, 
CLM, and PEL; n = 347) ancestry from the 1000 Genomes Project 
to build the reference panel for classifying genomic segments. We 
ran RFMix with the default minimum window size (0.2 centimor- 
gans, CM) and a node size of 5 with the flags: -w 0.2, -n 5. We then 
collapsed the output of RFMix, which denotes the classified 
ancestry of each SNP for each individual, into local ancestry seg- 
ments/tracts (in cM) for each individual. We then derived global 
ancestry proportions for each individual using that individual’s 
local ancestry tracts; we summed the length of local ancestry tracts 
for each ancestry (EUR, AFR, AMR) dividing by the total length of 
the genome to get the global proportion of each ancestry. Global 
ancestry proportions were visualized using the python-ternary 
package in Python (see Web Resources). 


Genetic Divergence 

We computed weighted Weir-Cockerham Fs; estimates for each 
pair of haplotype clusters using PLINK v.1.90b3.39.** Using the 
distance matrix of Fsr values between clusters, we constructed an 
unrooted phylogenetic tree using the neighbor joining method 
implemented in scikit-bio (see Web Resources). We visualized the 
tree using Interactive Tree of Life (see Web Resources). 


Effective Population Size 

We estimated effective population size with IBDNe.*”*° Using the 
inferred IBD segments between individuals for each cluster, we ran 
IBDNe in default mode separately for each cluster to infer the 
effective population size over time along with confidence 
intervals. 


Results 


Genetic Ancestry and Diversity across the United States 

To assess the diversity of ancestries among individuals in 
the Genographic Project, we first performed principal 
component analysis, projecting Genographic samples 
into the same principal component (PC) space as that of 
the 1000 Genomes Project samples (Figures 1A-1C, S1, 
and S2).*°*° Since self-reported ancestry was not consis- 
tently provided across all Genographic Project individuals, 
we leveraged the 1000 Genomes Project data to assign con- 
tinental ancestry to each Genographic sample (Material 
and Methods). We first trained a Random Forest classifier 


on the first 10 PCs of the 1000 Genomes Project samples 
with super population groupings as ancestry labels (EUR, 
European; AMR, Admixed American; AFR, African; EAS, 
East Asian; SAS, South Asian). We then used the trained 
model to assigned continental ancestry to each individual 
in the Genographic cohort at > 90% confidence. A total of 
3,028 individuals (9.3% of total) did not meet the classifi- 
cation threshold (Table S2). The inability to classify these 
individuals may be due to variable levels of admixture 
not reflected in the 1000 Genomes reference populations. 
No particular bias was found in the ancestral birth origin 
records for these individuals, as the top non-US origins 
are Germany (3.0%), Italy (2.6%), Poland (2.5%), UK 
(2.5%), and Mexico (2.0%). Overall, the assigned continen- 
tal ancestry was largely consistent with the self-reported 
ancestral ethnicity, as 95% of classified African-ancestry 
individuals and 85% of classified Hispanic-ancestry indi- 
viduals who reported ancestral data had at least one grand- 
parent of that ancestry (Material and Methods). 

Regional differences in genetic ancestry correspond to 
historical demographic trends. We evaluated the distribu- 
tions of classified individuals across the four designated 
US Census regions: South, Northeast, Midwest, and West 
(Table S2). Classified individuals of European descent 
make up the majority (78.5%) of the Genographic cohort 
and are the most prevalent in the Midwest (82.8% of indi- 
viduals in the Midwest; p < 0.01, Fisher’s exact test; Table 
S2). Admixed American ancestry individuals are most 
prominent in the West and South (9.7% and 7.8% of total 
individuals in the West and South, respectively; p < 0.05, 
Fisher’s exact test). Individuals classified as having African 
ancestry are most common in the South (3.2%), followed 
by the Northeast (3.0%). East Asians mostly reside in the 
West (2.1%), while South Asians are most abundant in 
the Northeast (1.0%). While the proportion of individuals 
classified as of European descent in the Genographic 
cohort (78.5%) are similar to the proportions of individuals 
reported as “White” in the US Census Data (76.1%; Table 
S3), we note that genetic ancestry is not a direct measure 
of ethnicity and race, and the two are not fully comparable 
(Supplemental Material and Methods). The large propor- 
tion of unclassified individuals also hinders our ability to 
properly compare the Genographic cohort to the US 
Census and understand how representative the Geno- 
graphic cohort is of the US population. Overall, the distri- 
bution of Genographic Project participants by state reflects 
the US population distribution reported in the Census 
(Spearman’s p = 0.91, p= 1.5 x 10~”°; Figure $3). However, 
the states of Washington, California, Virginia, Maryland, 
and Colorado have higher proportions (>1% difference) 
of participants when compared to the US population distri- 
bution while Texas and Ohio have lower proportions of 
participants (Table S4). For certain ancestries, some ascer- 
tainment bias exists. For example, individuals with African 
ancestry are overrepresented in California but are absent in 
Idaho, Maine, Nebraska, North Dakota, South Dakota, and 
Wyoming. 
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Figure 1. Genetic Diversity of the US 
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EUR (A) Principal Components Analysis (PCA) 
EAS of individuals in the United States and in 
e AMR the 1000 Genomes Project. Each individual 
: Res is represented by a single dot. Individuals 
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purple). Principal components (PC) 1 and 
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To uncover population substructure, we performed 
dimensionality reduction with Uniform Manifold Approx- 
imation and Projection (UMAP) on the first 20 PCs of a 
combined Genographic and 1000 Genomes Project 
dataset.’’”** By leveraging multiple PCs at once, UMAP 
can disentangle subcontinental structure (Figures 1D, 1E, 
S4, and $5). Similar to a previous analysis,** populations 
in the 1000 Genomes Project form distinct clusters 
corresponding to ancestry and geography. The Geno- 
graphic Project individuals project into several clusters, 
overlapping with the 1000 Genomes Project clusters. 
Consistent with the PCA and ADMIXTURE analysis, the 
largest clusters correspond to European ancestry and 
cluster closely with the 1000 Genomes CEU and GBR pop- 
ulations (CEU = Utah Residents with Northern and West- 
ern European Ancestry, GBR = British in England and 
Scotland). 

While UMAP is a visualization tool with no direct 
interpretation of genetic distance, the continuum of 


he Asian 
‘$F Americans 


= UMAP1- 


assigned a continent-level ancestry label 
using a random forest model trained on 
the super population labels and the first 
10 PCs of the 1000 Genomes Project 
dataset. OTH, individuals who did not 
meet the 90% confidence threshold for 
classification. 

(D and E) UMAP projection of the first 20 
PCs. Each dot represents one individual. 
In (D), individuals in the 1000 Ge- 
nomes Project are colored by population 
while US individuals from the National 
Geographic Genographic Project are in 
East gray. In (E), 1000 Genomes Project individ- 
uals are colored in gray while US individ- 
uals from the National Geographic 
Genographic Project are colored based 
on their admixture proportions from 
ADMIXTURE. The color for each dot was 
calculated as a linear combination of each 
individual’s admixture proportion and 
the RGB values for the colors assigned to 
each continental ancestry (EUR, red; AFR, 
yellow; NAM, green; EAS, blue; SAS, pur- 
ple). See Material and Methods for specific 
population labels. 


SAS OTH 


European 
Americans 


points connecting UMAP clusters reflects the varying de- 
grees of estimated admixture between different conti- 
nental ancestries. In particular, the complex population 
structure of Hispanics/Latinos is shown by the points 
spanning between the clusters of European, Native 
American, and African ancestry. Coloring of these points 
based on ancestry proportions affirms the relationship 
between the degree of admixture and their relative 
position between reference clusters. Interestingly, 
African American individuals from both datasets form a 
single continuum from the European cluster to the 
Yoruba (YRI) and Esan (ESN) populations of Nigeria in 
the 1000 Genomes Project, indicative of the West 
African origins of most African Americans. This observa- 
tion is consistent with and further expands the 
previous finding that the African tracts in the ad- 
mixed 1000 Genomes Project populations of ACB and 
ASW are similar to the Nigerian YRI and ESN 
populations.”’*” 
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Population Structure of East Asian and South Asian Individuals in the US 


(A and B) fineSTRUCTURE dendrogram showing the hierarchical relationship between clusters inferred using the genotypes of classified 
Fast Asian individuals (A) and South Asian individuals (B). Branch colors represent clades with shared ancestral origins. The admixture 
proportion of each individual is displayed as a bar plot in the corresponding position below the dendrogram. The number of ancestral 
populations, K, is four for Fast Asians (A) and three for South Asians (B). 

(C and D) Principal component analysis (PCA) of the fineSTRUCTURE co-ancestry matrix. Each individual (point) corresponds to a 
Genographic Project individual classified as either East Asian (C) or South Asian (D). The color of each point corresponds to a clade 


in the fineSTRUCTURE dendrogram shown in (A) and (B). 


Fine-Scale Structure among US Individuals of Asian 
Ancestry 

Existing genetic studies of the US population have largely 
overlooked Fast Asian and South Asian populations, likely 
due to their underrepresentation in datasets. We therefore 
explored the population structure of Genographic Project 
individuals classified as East Asians and South Asians. We 
used fineSTRUCTURE to first estimate patterns of haplo- 
type similarities between individuals, taking into consider- 
ation linkage disequilibrium, and then hierarchically clus- 
tered individuals based on these patterns of shared 
ancestry to identify clusters of populations and their rela- 
tionships. We applied fineSTRUCTURE to unrelated 
individuals in each population and inferred a total of 40 
Fast Asian clusters (Figure 2A) and 26 South Asian clusters 
(Figure 2B). These clusters further organized into clades 
on the tree to reveal broader genetic structure. To 
visualize these structures, we performed PCA on the 
fineSTRUCTURE coancestry matrix. Compared to tradi- 
tional PCA, distinctions between groups of individuals 
were clearer with fineSTRUCTURE PCA, particularly at 
the broader levels of genetic differentiation (Figures 2C 
and S6A; Figures 2D and S7A; Material and Methods). We 
also estimated subcontinental admixture proportions 
with ADMIXTURE using the East Asian and South Asian 


populations in the 1000 Genomes Project and the Human 
Genome Diversity Project (HGDP) as reference popula- 
tions (Figures S6B, S6C, S7B, and S7C). Finally, we lever- 
aged data from individuals who provided grandparental 
birth origin to help annotate and interpret the clusters 
and clades. 

The patterns of shared ancestry among these US individ- 
uals capture the genetic diversity of East Asia and South 
Asia. The East Asian clusters broadly organize into six ma- 
jor clades, reflecting the different countries of ancestral 
origin (Figure 2A). At the highest level of genetic differen- 
tiation (top level of the hierarchical tree), individuals from 
Southeast Asia separate from East Asians. This Southeast 
Asian clade is predominantly represented by Filipinos 
with a branch of individuals with more Oceanic origins 
(shown in gray and yellow, respectively). Admixture pro- 
portions vary among the Southeast Asian individuals, 
likely due to the large number of ethnolinguistic groups 
that are found in the Philippines and neighboring islands. 
The East Asian clade further separates into individuals of 
Chinese descent (light blue and dark blue) and those 
from Japan (dark red) and Korea (light red). While the 
two Chinese-related groups share a branch on the tree, 
Taiwanese ancestral origins are more prevalent in one of 
the groups (dark blue), the “China (+ Taiwan)” group, 
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Figure 3. Effective Migration Rates of African Americans, His- 
panics/Latinos, and Europeans within the United States. 
Migration rates inferred with EEMS for African Americans (A), His- 
panics/Latinos (B), and Europeans (C). EEMS models the relation- 
ship between genetics and geography by assessing the decay of 
genetic similarity with respect to geographic distance. Colors 
and values correspond to inferred rates, m, relative to the overall 
migration rate across the country. Shades of blue indicate higher 
migration (i.e., log(m) = 1 represents effective migration that is 
10-fold faster than the average) and higher levels of genetic simi- 
larity while shades of orange indicate migration barriers and lower 
levels of genetic similarity. 


while the other group (light blue), labeled “Southern 
China,” also contains some individuals from Laos and 
Vietnam. Lower levels of hierarchy did not differentiate 
these ancestral origins into separate groups. PCA and 
ADMIXTURE analysis for these two groups show that the 
China (+ Taiwan) cluster resembles the Han Chinese 
(CHB) population in the 1000 Genomes Project while 
the Southern China group resembles the Southern Han 
Chinese (CHS) population (Figure S6). Among the South 
Asian individuals, we observed genetic differentiation be- 
tween individuals with ancestral origins from India, re- 
flecting the diverse population structure previously 
observed in India.'** Of the three clades with majority In- 
dian ancestral origin, ancestral origins from Pakistan was 
observed in the “India (+ Pakistan)” clade, while Sri Lan- 


kan ancestral origins were present in the “India (+ Sri 
Lanka)” clade. Individuals in these two clades resemble 
the Punjabi from Lahore, Pakistan (PJL) and Sri Lankan 
Tamil (STU) populations in the 1000 Genomes Project, 
respectively (Figure S7). Similarly, we also find a clade of in- 
dividuals with Bangladesh ancestral origins that is similar 
to the 1000 Genomes Project Bengali from Bangladesh 
(BEB). Interestingly, we also inferred a small, but geneti- 
cally distinct “Girmitiyas” clade (N = 12; blue branch in 
Figure 2B). While the small sample size makes it difficult 
to accurately assess this clade, we note that many former 
British colonies (e.g., Trinidad and Tobago, Fiji, Barbados, 
Guyana) are represented in the ancestral origins of these 
individuals. We therefore hypothesize that these individ- 
uals may potentially be descendants of Girmitiyas, inden- 
tured Indian laborers brought to those former colonies.*” 


Population Differentiation and Migration Rate Inference 
across the United States 

Understanding the relationship between genetics and ge- 
ography can provide insights into demographic history. 
Previous analyses of this relationship in the US population 
have primarily compared data aggregated at the state or 
regional level.” Such approaches, however, do not cap- 
ture the fine-scale patterns of genetic similarity that are 
not influenced by discrete political boundaries. We there- 
fore sought to infer population structure across continuous 
space with the estimating effective migration surfaces 
(EEMS) method.*! FEMS statistically measures effective 
migration rates by overlaying a dense grid of evenly spaced 
demes and calculating genetic differentiation (i.e., resis- 
tance distance) between neighboring demes. Higher rates 
of migration are inferred in locations where genetic simi- 
larity is high (colored in blue in Figure 3) while lower rates 
of migration are inferred in locations where genetic simi- 
larity is low (colored in dark orange). Areas with low effec- 
tive migration are also referred to in EEMS as “barriers,” 
which can be intuitively interpreted as regions in which 
neighboring populations are more genetically dissimilar 
than expected. In more homogeneous populations, these 
barriers tend to indicate isolation by distance, while in 
more heterogenous populations, they may reflect differ- 
ences in population structure. We applied EEMS to 
genetically classified Europeans, African Americans, and 
Hispanic/Latinos across the contiguous 48 states. We 
excluded East Asians and South Asians due to low sample 
density. 

The inferred migration rates for African Americans reveal 
genetic signatures of historical demographic events (Fig- 
ures 3A, S8, and S9). Along the Atlantic coast from the 
Florida Panhandle to southern Maine, genetic similarity 
and effective migration rates are relatively high, indicating 
the constant migration and similar effective population 
sizes of African Americans in these states. However, 
we also observe a strong north-south barrier to migra- 
tion starting along the Appalachian Mountain Range, 
continuing north up the Mississippi River, and extending 
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Table 1. Summary of Haplotype Clusters 
Median Median 

Samples Sum of Cumulative 
Cluster (Count) ROH (Mb) IBD (cM) 
Northwest Europe 1 11,725 2.88 15.23 
Northwest Europe 2 1,571 2.80 15.15 
Ireland 2,137 2.85 15.42 
Central Europe 3,116 2.83 15.06 
Eastern Europe 2,471 3.16 15.37 
Southern Europe 1,626 2.73 14.98 
Italy 697 6.91 14.64 
Greece-Italy 238 7.28 15.02 
Scandinavia 717 3.02 15.54 
Finland 314 3.67 17.50 
Acadia 249 3.89 19.48 
French Canadian 314 2.89 16.60 
Ashkenazi Jewish 1,475 11.26 31.75 
Admixed Jewish 445 2.75 15.50 
Hispanics/Latinos 810 3.53 16.38 
Hispanics/Latinos in 573 4.10 17.11 
California 
Hispanics/Latinos in 163 5.52 21.92 
New Mexico 
Hispanics/Latinos in Texas 177 6.27 23.65 
Puerto Rico 350 8.01 26.23 
African Americans South 761 3.34 19.56 
African Americans North 420 2.94 15.90 
Fast Asia 561 3.65 19.63 
Southeast Asia 325 8.44 17.90 
South Asia 389 10.42 14.82 
Greater Middle East 93 9.01 17.16 


Sum of runs of homozygosity (SROH) was calculated by summing the lengths 
of continuous homozygous segments >1 Mb. Cumulative IBD was deter- 
mined by summing IBD segments of >3 cM and filtering for only pairs 
>12 cM and <72 cM. Statistics were determined within haplotype clusters, 
rather than across the ancestrally heterogeneous and imbalanced full network. 


west across the rest of the country. This migration barrier, 
along with the migration barrier spanning Texas and New 
Mexico, reveals a pattern of genetic relatedness across ge- 
ography that is consistent with the Great Migration from 
the 1910s to the 1960s in which an estimated 6 million 
African Americans migrated out of the South to cities 
across the Northeast, Midwest, and West.”°° To under- 
stand whether this migration barrier is influenced by sam- 
pling bias, we subsampled individuals and simulated three 
different sampling schemes (Material and Methods). We 
found that the north-south migration barrier was consis- 
tently present in all three sampling schemes, confirming 
that the inferred migration results of EEMS are robust to 
irregular sampling (Figure $10).*! 


A highly complex pattern of genetic similarity exists 
among present-day Hispanics/Latinos across the country, 
capturing regional genetic structure. Across the south- 
western states, two regions bordering Mexico—one in Cal- 
ifornia and another extending from New Mexico to 
Texas—exhibit high levels of genetic similarity and effec- 
tive migration rates (Figures 3B, S8, and S9). Separated by 
a migration barrier in Arizona, these two distinct regions 
likely reflect known differences in the northward migra- 
tion from east versus west Mexico.*”' High genetic similar- 
ity and relative rates of effective migration are also 
observed in Florida and continue northward. However, 
barriers to migration are observed in states immediately 
east of the Mississippi River, likely resulting from varying 
degrees of admixture. 

The patterns of genetic similarity for Europeans capture 
subcontinental structure. With the exception of the states 
in the Midwest and along the Atlantic coast, elevated levels 
of genetic similarity and relative migration rates are 
observed across most of the country. We find low effective 
migration rates surrounding Minnesota and Michigan, 
likely due to the genetic dissimilarity of Finnish and Scan- 
dinavian ancestry that is abundant in the region (Figures 
3C, S8, and S9).° We also find reduced migration rates 
across Ohio, West Virginia, and Virginia, suggesting the ex- 
istence of genetic differentiation along the Appalachian 
Mountains. Many of the major cities, such as Washington, 
DC, Philadelphia, and Miami, also exhibit low genetic sim- 
ilarity, perhaps due to greater genetic diversity and admix- 
ture within cities. 


Coupling Fine-Scale Haplotype Clusters and 
Multigenerational Birth Records Uncovers Distinct 
Subcontinental Structure 

To disentangle more recent and subtle population struc- 
ture, we performed identity-by-descent (IBD) clustering 
on the Genographic cohort and annotated clusters using 
multigenerational self-reported birth origin data. We first 
built an IBD network from pairwise IBD sharing among 
31,783 unrelated individuals, where vertices represent in- 
dividuals and edges represent the cumulative IBD (in cen- 
timorgans, cM) between pairs of individuals. We employed 
the Louvain method, a greedy heuristic algorithm, to 
recursively partition vertices in the graph into clusters 
that maximize modularity for each iteration.*’** The clus- 
ters of individuals resulting from each iteration can be in- 
terpreted as having greater amounts of cumulative IBD 
shared between individuals within the cluster than with 
those outside of the cluster. To aid in the interpretation 
of the clusters, we merged clusters with low genetic differ- 
entiation (Fs; < 0.0001), resulting in a final set of 25 clus- 
ters (Table 1). We annotated each cluster based on ancestral 
birth origin and ethnicity data and constructed a neighbor- 
joining tree based on the Fsy values (Figure S11). 98% of 
the 3,028 individuals that were not classified by our 
Random Forest model were assigned to a haplotype cluster. 
No single cluster was overrepresented by unclassified 
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Figure 4. Geographical Distribution of Hispanic/Latino Haplotype Clusters 

(A) Each dot corresponds to a county containing present-day individuals and the size of the dot signifies the number of samples of the 
particular cluster in that county. Only the Hispanic/Latino cluster with the highest odds ratio is shown for each county, and for clarity, 
only the top ten locations with the highest odds ratio are shown for each cluster. Maps showing the full distribution for each haplotype 


cluster can be found in the supplement (Figure S15). 


(B) Ancestral birth origin proportions of each cluster for individuals with complete pedigree annotations, up to grandparent level. Pro- 
portions were calculated from aggregating the birth locations of all grandparents corresponding to members of each haplotype cluster. 
For each chart, only the top five birth origins are shown as individual proportions; the remaining birth origins are aggregated into one 


slice (lightest color). 


(C) Ternary plots of ancestry proportions based on local ancestry inference for each haplotype cluster. Each dot represents one 


individual. 


individuals, as unclassified individuals comprised of 8%- 
11% of each cluster. 

Genetic and geographic differences are greatest among 
Hispanic/Latino haplotype clusters. We identified a total 
of five Hispanic-related clusters (Figure 4). The largest of 
these cluster (n = 810; orange in Figure 4A) is strongly asso- 
ciated with south Florida (OR = 10.4; p = 2.5 x 10-2; Ta- 
ble S5) but is also found in California and Texas (OR > 2; p 
< 0.05; Table S5). No single ancestral birthplace character- 
izes this cluster, as the US, Mexico, and Cuba each make up 
more than 10% of the birth origin labels (Figure 4B). Pro- 
portions of European ancestry tracts inferred with 


RFMix** are higher in this cluster (mean = 72.7%, SD = 
20.4%; Figure 4C) than in the other Hispanic/Latino clus- 
ters (mean = 48.0%-67.4%; Figure 4C). Puerto Ricans char- 
acterize a substantial proportion of another Hispanic/ 
Latino cluster associated with Florida (OR > 4) as well as 
New York City (OR > 5). Unlike the other Hispanic clusters, 
the Puerto Rican cluster shares the same branch on the Fsr 
tree as the African American clusters (Figure S11), likely 
due to relatively high proportions of African ancestry 
(mean = 11.2%, SD = 9.0%) among Puerto Ricans. Median 
lengths of sROH and cumulative IBD in Puerto Ricans are 
also the highest among the Hispanic clusters (8.01 Mb 


380 The American Journal of Human Genetics 106, 371-388, March 5, 2020 


and 26.23 cM, respectively; Table 1). Consistent with other 
studies,*°°* we found evidence of a strong bottleneck 
in Puerto Ricans approximately 9-14 generations ago 
(Figure S12), coinciding with the colonization of America 
and likely explaining the elevated levels of IBD and sROH. 

Three distinct clusters of Hispanics/Latinos were found 
in the Southwest (Figure 4A): one strongly associated 
with New Mexico (OR > 4; p < 0.05), another primarily 
in Texas (OR > 3; p < 0.05), and the third associated 
with Southern California (OR > 2; p < 0.05). Combined 
with the EEMS analysis, these clusters confirm our observa- 
tion of parallel migration routes from east and west Mexico 
into Southwestern United States. While genetic differenti- 
ation between these three clusters are subtle (Fs; = 0.001- 
0.003), comparison of the ancestral birth origin patterns 
and local ancestry proportions of these clusters reveal 
meaningful differences in their population history. 
Whereas the majority of Hispanics/Latinos in New Mexico 
report US ancestral origins, the recent ancestors of His- 
panics/Latinos in Texas are predominantly from Mexico. 
Nonetheless, these two clusters share similar local ancestry 
proportions with only slight genetic dissimilarity that 
result in a moderate decrease in migration rate (from darker 
blue to light blue in Figure 3B). Unlike the Hispanic/Latino 
clusters associated with New Mexico and Texas, the His- 
panics/Latinos in California cluster contain greater propor- 
tions of ancestors from Central and South America (e.g., 
Colombia and Fl Salvador). Proportions of Native Amer- 
ican ancestry (Figure 4C) and effective population size 
(Figure $12) are also higher in this cluster, but median cu- 
mulative IBD and sROH length are shorter (Table 1), 
similar to Central/South Americans found in New York 
City.’ Taken together, these two differences further 
explain the presence of the migration barrier in Arizona be- 
tween Hispanic/Latino individuals in California and those 
in New Mexico. 

Historical immigration of Europeans into the US 
occurred in successive waves, with northern and western 
Europeans making up one wave from the 1840s to 1880s 
and another wave comprised of southern and eastern Euro- 
peans occurring from the 1880s to 1910s.°* Consistent 
with this immigration pattern, haplotype clusters with an- 
cestries from northwest and central Europe have higher 
proportions of US ancestral birth origins than haplotype 
clusters from southern and eastern Europe, suggesting 
earlier immigration (Figures SA and 5B). The two clusters 
with the highest proportion (>75%) of US ancestral birth 
origin (“Northwest Europe 1” and “Northwest Europe 2”) 
have ~4.5% of UK ancestral origins. The central European 
cluster and the Irish cluster both have 66.1% and 68.5% of 
US ancestral origins, respectively (Figure 5B). In contrast, 
the US makes up only 62.2% and 34.5% of ancestral birth 
origin for the clusters of southern Europeans and eastern 
Europeans, respectively. 

Unlike the larger European clusters, the smaller Euro- 
pean clusters reflect the structure of recent immigrants 
and genetically isolated populations, recapitulating earlier 


findings. The geographic distributions of these subpopu- 
lations are more concentrated, and their ancestral birth 
origin proportions are overrepresented by specific coun- 
tries and ethnicities (Figures 6A and 6B). Specifically, Finns 
and Scandinavians are abundant in the Upper Midwest 
and Washington; French Canadians are found in the 
Northeast; Acadians are present in the Northeast and Loui- 
siana; and Italians, Greeks, and Jews are mostly located in 
the metropolitan area of New York City (Figure 6A). Of the 
European clusters, median cumulative IBD sharing and 
sROH lengths are highest among Ashkenazi Jews 
(31.8 cM and 11.3 Mb, respectively; Table 1), reflective of 
past founding events and endogamy.”'* The two Jew- 
ish-related clusters were identified using self-reported 
ancestral ethnicity data rather than birth origin data, since 
Jewish ancestry is not specific to any single location. Jew- 
ish ancestry, particularly Ashkenazi Jewish ancestry, is 
more consistently reported on both sides of the family in 
the larger cluster, while individuals in the smaller cluster 
more commonly reported Jewish ancestry on only one 
side of the family, suggesting the presence of admixture 
with non-Jewish ancestries. Therefore, the larger cluster is 
labeled “Ashkenazi Jewish” and the smaller cluster is 
labeled “Admixed Jewish.” 

We inferred two haplotype clusters of African Americans 
separated along a north-south cline, recapitulating the 
EEMS migration barrier inference. One cluster is primarily 
distributed among the northern and western states (“Afri- 
can Americans North”), while the other is distributed 
among the states southeast of the Appalachian Mountains 
(“African Americans South”) (Figure $13). The proportion 
of US birth origin is higher in the northern cluster than 
the southern cluster, providing further evidence of isola- 
tion-by-distance among African Americans in the north.’ 
These two clusters share similar sROH lengths but differ in 
admixture proportions and median IBD sharing (Table 1), 
pointing to a cluster with consistent African American an- 
cestors and a cluster with more admixed ancestors. Median 
cumulative IBD sharing is higher among African Americans 
in the south (median cumulative IBD = 19.6 cM, median 
sROH = 3.3 Mb) than in the north (median = 15.9 cM; 
Table 1), resulting in different patterns of effective popula- 
tion size over antecedent generations (Figure $12),*°*° 
while the average proportion of African ancestry is higher 
in the northern cluster than the southern cluster. 

Four of the clusters reflect recent immigrants from Asia 
(Figure $14), which grew rapidly in the mid-20th century af- 
ter the elimination of national origin quotas.” The recency 
of immigration among these clusters is supported by the 
observation that fewer than 30% of grandparents were 
born in the US. Geographically, individuals in these clusters 
primarily reside in major cities. East Asians predominantly 
inhabit the metropolitan areas of the west and northeast 
(OR > 2), Southeast Asians are enriched in the west (OR > 
2.5), and South Asians are strongly associated with the 
northeast (OR > 2.5). Despite its small size, the cluster of 
Greater Middle East individuals reflects many of the known 
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Figure 5. Geographical Distribution of European American Haplotype Clusters 

(A) Each dot represents a county containing present-day individuals. The size of the dot represents the number of individuals of the 
particular cluster in that county. For each cluster, the top 20 locations with the highest odds ratio are shown. Maps showing the full 
distribution for each cluster can be found in the supplement (Figure $16). 

(B) Ancestral birth origin proportions for each cluster in (A). Only individuals with complete pedigree annotations, up to grandparent 
level, are included. For each chart, only the top five birth origins are shown as individual proportions; the remaining birth origins are 


aggregated into one slice (lightest color). 


demographic patterns of Arab Americans, as individuals in 
this cluster are primarily of Lebanese origin and are distrib- 
uted in the northeast as well as metropolitan Detroit. sROH 
lengths are particularly long for South Asians (median 
sROH = 10.3 cM; Table 1), Southeast Asians (median 
sROH = 7.8 cM), and Middle Easterners (median sROH = 
8.2 cM), potentially reflecting patterns of consanguinity 
and inbreeding in their ancestral regions.°° In particular, 
the median sROH length in the South Asia cluster is the sec- 
ond highest among all clusters, but the median cumulative 
IBD length is similar to most clusters (Table 1). The popula- 
tion of South Asia is large and diverse, with many endoga- 
mous groups making up the 1.5 billion people living in 
the region.’””°* The pattern of IBD and sROH among indi- 
viduals in South Asian cluster thus may reflect the result 
of recent consanguinity in a large population.**”” 


Discussion 


As the US population is becoming increasingly diverse, 
genomic studies are simultaneously growing in scale 


and relevance; to increase scientific and ethical parity, 
these studies must move beyond the current practice of 
evaluating genetically homogeneous groups in isola- 
tion.*””°° Here, we provide an integrative framework for 
analyzing population structure in ancestrally heteroge- 
neous individuals. Our comprehensive approach has al- 
lowed us to capture spatial patterns of gene flow within 
and between subpopulations that are difficult to infer 
from a single method alone. For example, while EEMS 
enabled us to examine genetic similarity at a finer scale 
than previous studies and identify genetic differentiation 
within a state, EEMS can only compare neighboring 
demes and does not directly evaluate the genetic similar- 
ity of geographically distant individuals. Haplotype clus- 
tering, on the other hand, can identify population 
structures over long distances, but it does not measure 
genetic similarity with respect to geography. Since 
individuals are exclusively assigned to a single cluster, 
information regarding admixture, especially between 
neighboring clusters, are lost during haplotype clus- 
tering. An integrative approach can thus enable greater 
insights into populations with complex histories, as 
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Figure 6. Geographical Distribution of Genetically Differentiated European American Haplotype Clusters 

(A) Similar to Figure SA but corresponding to European populations that are more genetically isolated. For clarity, the top ten locations 
with the highest odds ratio are shown for each cluster. Full distributions for each cluster can be found in the supplement (Figure S17). 
(B) Ancestral birth origin proportions for each cluster in (A). Only individuals with complete pedigree annotations, up to grandparent 
level, are shown. For each chart, only the top five birth origins are shown as individual proportions; the remaining birth origins are aggre- 


gated into one slice (lightest color). 


well as populations typically overlooked in previous 
studies such as Asian Americans. 

The genetic structure and history of Hispanic/Latino 
populations is particularly complex due to many historical 
migration and admixture events.*”” This complexity is re- 
flected in the variable migration rates across the country 
and the large variations in admixture proportions within 
and between subpopulations. While prior analysis of His- 
panics/Latinos in the US found differences in ancestry pro- 
portions aggregated at the state level,’ we demonstrate that 
considerable differences in genetic ancestry also exist 
within a state. For example, two distinct clusters—Puerto 
Rico and Hispanics/Latinos—are found in Florida, with 
the Puerto Rico cluster having higher average African 
ancestry proportions than the Hispanics/Latinos cluster 
(9.0% versus 2.5%, respectively). EEMS also enabled direct 
measures of genetic similarity within states and between 
subpopulations. While the mean ancestry proportions 
are similar between the New Mexican cluster and the 
Texan cluster, individuals in northern New Mexico are 
more genetically differentiated than individuals in south- 
ern New Mexico, as indicated by the migration barrier. 
The individuals in northern New Mexico are likely Nuevo- 


mexicanos, descendants of Spanish colonial settlers, while 
those in the south are more genetically similar to 
Hispanic/Latino individuals in central Texas, likely 
because they share a common ancestral origin (i.e., 
Mexico). We also built upon the use of pedigree annota- 
tion® by quantifying ancestral origins to better understand 
the differences in genetic ancestry between subpopula- 
tions. For example, in the Hispanics/Latinos in California 
cluster, the mean proportion of European ancestry is 
smaller when compared to the New Mexican and Texan 
clusters, reflecting the lower proportion of US ancestral 
origins. Comparison of sROH and IBD lengths of these 
clusters further reveal evidence of founder effects. Puerto 
Ricans, Hispanics/Latinos in Texas, and Hispanics/Latinos 
in New Mexico had the highest median IBD lengths and 
showed evidence of recent bottleneck (Figure S12), consis- 
tent with prior studies.*°”°* In general, median sROH and 
IBD lengths were higher in Hispanic-related clusters than 
European clusters, reflecting the patterns found in refer- 
ence populations’°*’®! and in line with recent findings in 
New York City.°* 

The demographic history of African Americans is 
characterized by large-scale migration and admixture, 
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primarily due to the transatlantic slave trade and racial 
segregation.°°’” The patterns of genetic ancestry and relat- 
edness between states and regions of the US reflect these 
events.™””? Our results show, at a finer scale, the barriers 
to migration and gene flow, particularly along the Appala- 
chian Mountains. This migration barrier overlaps with the 
boundary between slave states and free states, as well as the 
boundary between states that enacted laws enforcing racial 
segregation and states that forbade segregation. The north- 
south separation of two African American clusters further 
emphasize this divide. The African Americans South clus- 
ter contains more recent ancestors from outside the US, 
particularly from the Caribbean, than the African Ameri- 
cans North cluster. These insights further emphasize the 
impact of historical migration and socioeconomic divide 
on the present-day patterns of genetic relatedness among 
African Americans. 

Despite accounting for more than 5% of the US popula- 
tion, individuals with Asian ancestries are underrepre- 
sented in US population genetics studies, hindering the 
ability of prior studies to investigate of their ancestry.” 
Our analyses of these individuals therefore provide new in- 
sights into their genetic structure. Many of these individ- 
uals are descendants of recent immigrants, as indicated 
by the high proportions of non-US grandparental ancestral 
origin; therefore, they likely reflect the population of their 
ancestral region. The genetic structure of these individuals 
is particularly diverse. Using fineSTRUCTURE, genetic dif- 
ferentiation was found between East Asian and South 
Asian individuals of different ancestral origin as well as be- 
tween individuals with the same ancestral origin. At the 
same time, longer sROH was observed in the Southeast 
Asia, South Asia, and Greater Middle East haplotype clus- 
ters, likely reflecting consanguinity or endogamy patterns 
in their ancestral countries. For example, the long sROH 
in South Asians may reflect endogamy related to the 
caste system in India, while similar patterns among the 
Middle Eastern and Southeast Asian clusters may be 
capturing consanguineous marriage practices in those 
regions.**’°*°* Understanding population genetic struc- 
ture and patterns of homozygosity are important in 
determining the genetic profile of diseases within 
subpopulations, especially since these recent immigrants 
are becoming less similar to those in their ancestral coun- 
tries due to outbreeding, admixture, and population 
growth.°”’°° As populations mix, heterozygosity increases 
and allele frequencies change. This, in turn, can alter the 
prevalence of certain diseases, particularly rare recessive 
disorders that are often more prevalent in populations 
with increased homozygosity.°’ At the same time, changes 
in allele frequencies can also reduce the accuracy of genetic 
predictors of complex traits (i.e., polygenic risk scores), 
especially if the prediction model was built using a homo- 
geneous cohort of individuals from a divergent ancestry.°° 

Population history in the US is best characterized among 
individuals of European descent. Genetic diversity tends to 
be highest in more densely populated regions, likely due to 


multiple populations living in the same place. Many of the 
European subpopulations we identified are similar to those 
previously found—e.g., French Canadians, Acadians, Scan- 
dinavians, and Ashkenazi Jews. The geographic distribu- 
tion of these subpopulations, particularly those that are 
more genetically diverged, overlap in the metropolitan 
areas of the Northeast, Midwest, and California. These 
overlaps may explain the presence of certain EEMS-in- 
ferred migration barriers. For example, the migration 
barrier and lower genetic similarity encompassing metro- 
politan New York City may be explained in part by the 
large presence of Greeks, Italians, and Ashkenazi Jews in 
that area. 

The precision of population labels assigned to clusters of 
individuals is a function of demographic complexity and 
sample size. For example, Finnish ancestry is clearly Euro- 
pean but genetically distinct from several other European 
populations due to historical bottlenecks, making this 
ancestry cluster relatively easily separable. By contrast, 
most Americans of European descent have heterogeneous 
ancestors from several northwestern European countries 
who have admixed over time, resulting in relatively evenly 
distributed ancestry overlapping that of present-day Euro- 
peans from multiple primarily northwestern countries. 
Additionally, while we identify and describe some substan- 
tial structure among Hispanic/Latino populations, consid- 
erably more is likely to exist and remains to be learned 
from larger and more diverse future studies. Similarly, 
sub-regional resolution into the ancestry of recent Asian 
immigrants has been relatively limited in population ge- 
netics studies, and the structure of this immigration will 
be learned from larger future studies. Interestingly, we 
found that fineSTRUCTURE was able to disentangle Asian 
subpopulations at a finer resolution than haplotype clus- 
tering, demonstrating the tradeoff between resolution 
and scale of these two methods and further highlighting 
the value of an integrated approach. The accuracy of self- 
reported birth records and variable granularity of geopolit- 
ical boundaries also provide additional considerations 
regarding the precision of population labels. 

In addition to being of anthropological interest, under- 
standing fine-scale human history and its role in shaping 
genetic variation is also important for interpreting the ge- 
netic basis of biomedical traits. The emergence of biobank- 
scale genomic data is enabling the imputation of pedigree 
structure regardless of whether some relatives have 
contributed DNA,°®° greater insights into the impact of 
fine-scale population structure on genetic associations 
with disease, '*'7?7°°°? and population-based screening 
for individuals with serious genetic and health-related as- 
sociations.’° Standard practice in genetic studies to date 
has involved identifying the largest genetically homoge- 
neous population in a study (typically European ancestry) 
and conducting genetic analysis excluding other popula- 
tions.’ '’’* However, as genetic studies become increasingly 
promising for clinical translation, this practice has led 
to concerns about genetic tools exacerbating health 
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disparities, particularly for populations underrepresented 
in genetic studies.*”’°° Participation in genetic programs 
is increasing in the US, for example with the All of Us 
Research Program (Web Resources) or with direct-to-con- 
sumer genetic tests that an estimated 26 million people 
have taken, and many of these participants are of diverse 
non-European ancestry.’’’* As a result, the need for 
including more diverse populations in genetic studies 
and for inferring more granular demographic histories in 
diverse study cohorts is becoming greater. Understanding 
such structure is important to account for stratification 
in association studies, prevent the overgeneralization of 
potentially confounded results, and avoid exacerbating 
existing Eurocentric study biases.°”’'’>'’° This study 
demonstrates how genetic data can be coupled with 
geographic and birth origin data to reconstruct such demo- 
graphic histories, particularly in a large and heterogeneous 
population. 


Data and Code Availability 


Genotype data and associated metadata are available to re- 
searchers through an application process and data usage 
agreement. We encourage qualified researchers to email 
the Genographic team at National Geographic Society 
(genographic@ngs.org) for information on and access 
to the Genographic database. For more information, 
please visit the Genographic Project website (https:// 
genographic.nationalgeographic.com/for-scientists/). 
Custom scripts generated to analyze the data in this pa- 
per are available through GitHub (https://github.com/ 
chengdai/genographic_ancestry). 
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